import pandas as pd
from pathlib import PosixPath
Extracting Role Titles
Our goal is to extract role information from job ads to try to understand the job ads better. This is a pretty complex task: role titles are hidden in the text, and can be very ambiguous (“Manager”) or very specific (“Subsea Cabling Engineer”). This notebook scopes out the problem and looks at extracting common examples of role titles.
Load in the Data
Get the data from Adzunda Job Salary Prediction Kaggle Competition, put it in the data subfolder and unzip all the files.
You can do this manually, or use the Kaggle API (once you’ve installed the API, downloaded your kaggle.json
file and agreed to the competition rules)
# for split, ext in [('Test', 'zip'), ('Train', 'zip'), ('Valid', 'csv')]:
# !kaggle competitions download -c job-salary-prediction --path data/ -f {split}_rev1.{ext}
# !find data/ -name '*.zip' -execdir unzip '{}' ';'
# !find data/ -name '*.zip' -exec rm '{}' ';'
# !ls data/
%%time
= []
dfs for split in ['Train', 'Valid', 'Test']:
f'data/{split}_rev1.csv').assign(split=split))
dfs.append(pd.read_csv(= pd.concat(dfs, sort=False, ignore_index=True)
df 'Title'] = df['Title'].fillna('')
df[del dfs
CPU times: user 6.52 s, sys: 3.06 s, total: 9.58 s
Wall time: 12.6 s
len(df)
407894
= 200
pd.options.display.max_columns = 100 pd.options.display.max_colwidth
There are a bunch of different information in the role titles:
- Roles: “Engineering Systems Analyst”, “Stress engineer”, “Subsea cables engineer”
- Location like “Glasgow” or “East Midlands”
- Seniority like “Senior”, “Principal”, “Lead”, or “Trainee”
- Industry: Like “Pharmaceutical”, “Construction”,
- Selling points/working conditions of the job: “Award Winning Restaurant”, “Excellent Tips”, “Self Employed”, “does it get any better than this?”
- Company names: “Nevill Crest and Gun”, “The Refectory”
Sometimes there are multiple roles (often multiple descriptions of the same role):
- Engineering Systems Analyst / Mathematical Modeller
- Electrical / ICA Engineer
Sometimes it’s ambiguous: is “Modelling and simulation analyst” one role or two (“modelling analyst” and “simulation analyst”?); similarly with “C/C++ developer”. Is “Bilinguial Reservationist” a role title, or is it just “Reservationaist” and “Bilingual” is a skill required for the job?
To understand the job we’ll also need to understand some of the acronyms like:
- MICE Sales: Meetings, incentives, conferences and exhibitions
- ICA Engineer: Instrumentation Control and Automation
50).reset_index() df.Title.head(
index | Title | |
---|---|---|
0 | 0 | Engineering Systems Analyst |
1 | 1 | Stress Engineer Glasgow |
2 | 2 | Modelling and simulation analyst |
3 | 3 | Engineering Systems Analyst / Mathematical Modeller |
4 | 4 | Pioneer, Miser Engineering Systems Analyst |
5 | 5 | Engineering Systems Analyst Water Industry |
6 | 6 | Senior Subsea Pipeline Integrity Engineer |
7 | 7 | RECRUITMENT CONSULTANT INDUSTRIAL / COMMERCIAL / ENGINEERING / DRIV |
8 | 8 | RECRUITMENT CONSULTANT CONSTRUCTION / TECHNICAL / TRADES LABOUR |
9 | 9 | Subsea Cables Engineer |
10 | 10 | Trainee Mortgage Advisor East Midlands |
11 | 11 | PROJECT ENGINEER, PHARMACEUTICAL |
12 | 12 | Principal Composite Stress Engineer |
13 | 13 | Senior Fatigue Damage Tolerance Engineer |
14 | 14 | Chef de Partie Award Winning Restaurant Excellent Tips |
15 | 15 | Quality Engineer |
16 | 16 | Principal Controls Engineer |
17 | 17 | Chef de Partie Award Winning Dining Live In Share of Tips |
18 | 18 | Senior Fatigue and Damage Tolerance Engineer |
19 | 19 | C I Design Engineer |
20 | 20 | Lead Engineers (Stress) |
21 | 21 | Relief Chef de Partie Croydon, Surrey Live in |
22 | 22 | Senior Control and Instrumentation Engineer |
23 | 23 | Control and Instrumentation Engineer |
24 | 24 | Electrical / ICA Engineer |
25 | 25 | Pastry Chef for **** red star **** rosette hotel **** |
26 | 26 | Senior Process Engineer |
27 | 27 | CHEF DE PARTIE POSITION IN **** ROSETTE HOTEL NYORKS ****k |
28 | 28 | Senior Sous Chef for **** rosette kitchen, up to **** |
29 | 29 | General Manager Funky, Cool Restaurant Concept London ****k |
30 | 30 | MICE Sales and Marketing Manager |
31 | 31 | C/C++ Developer |
32 | 32 | Senior PHP Developer |
33 | 33 | Senior Website Designer |
34 | 34 | Business Development Manager |
35 | 35 | Welwyn Chef de Partie does it get any better than this? **** |
36 | 36 | Chef de Partie Sauce Award Winning Hertford **** |
37 | 37 | Pastry Chef AL**** ****AA Rosette Restaurant |
38 | 38 | QA Engineer |
39 | 39 | Documentation Engineer |
40 | 40 | Bilingual Customer Service Operator |
41 | 41 | Customer Event Coordinator (German speaking) |
42 | 42 | Senior Planner |
43 | 43 | Bilingual Reservationist (Customer Service) |
44 | 44 | Trampoline Coach Bushey Grove Leisure Centre |
45 | 45 | Self Employed Swimming Instructors |
46 | 46 | Self Employed Sport Coaches |
47 | 47 | Bar/Waiting Staff The Cricketers, Sarratt |
48 | 48 | Deputy Manager Nevill Crest and Gun, Eridge Green |
49 | 49 | Bar/Waiting Staff The Refectory, Godalming |
Let’s look at the most frequent titles. If different companies use the same title it’s much less likely to have specific job features (like location, company info, or benefit).
= (
titles
df'Title')
.groupby(=('Company', 'nunique'), jobs=('Id', 'count'))
.agg(companies'companies', 'jobs'], ascending=False)
.sort_values([
)len(titles)
196165
Only 20% of the ad titles occur in more than 1 company
'companies'] > 1).mean() (titles[
0.1913440216144572
10% of the ad titles occur in 0 companies. This is likely because the title is empty and pandas read it in as NA. This is small enough that we can ignore it for this purpose
'companies'] == 0).mean() (titles[
0.10200086661738843
Cutting off at 2 there are still some weird things here.
== 2] titles[titles.companies
companies | jobs | |
---|---|---|
Title | ||
Assistant Sales Manager Market Leading Retailer | 2 | 66 |
Vehicle Purchaser / Car Sales | 2 | 55 |
AREA RELIEF OFFICER | 2 | 53 |
Vehicle Technician MOT Tester | 2 | 42 |
Staff Nurse (RGN) Nursing Home | 2 | 33 |
... | ... | ... |
warehouse assistant | 2 | 2 |
warehouse operatives | 2 | 2 |
web designer | 2 | 2 |
yEAR ****/4 TEACHER CARLTON **** PER DAY | 2 | 2 |
zSeries Specialist zSeries UK Wide | 2 | 2 |
25416 rows × 2 columns
One reason is the same job can come through two different job boards (SourceName
), and they may have different ways of representing the company name or have errors obtaining it.
For example “hyphen” Company sounds like a mistake here.
str.startswith('zS')] df[df.Title.
Id | Title | FullDescription | LocationRaw | LocationNormalized | ContractType | ContractTime | Company | Category | SalaryRaw | SalaryNormalized | SourceName | split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49509 | 68626801 | zSeries Specialist zSeries UK Wide | zSeries Technical Specialist required for London, My high profile client (leading financial bran... | London | London | NaN | permanent | Spring Technology | IT Jobs | 32000.00 - 42000.00 GBP Annual | 37000.0 | jobserve.com | Train |
63044 | 68702465 | zSeries Specialist zSeries UK Wide | zSeries Technical Specialist required for London , My high profile client (leading financial bra... | City London South East | London | NaN | permanent | hyphen | IT Jobs | 32000 - 42000 per annum | 37000.0 | totaljobs.com | Train |
Here the company for the second job is ‘UKStaffsearch’ which is the name of the job board. The job board must replace the title.
Note that one is from the Train set and one from the Test set! This is a data leak.
str.startswith('yEA')] df[df.Title.
Id | Title | FullDescription | LocationRaw | LocationNormalized | ContractType | ContractTime | Company | Category | SalaryRaw | SalaryNormalized | SourceName | split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
140597 | 70577243 | yEAR ****/4 TEACHER CARLTON **** PER DAY | Year ****/4 Teacher required for Mapperley Area TeacherActive are currently recruiting for a Pri... | Nottingham, Nottinghamshire, England, West Yorkshire | UK | NaN | contract | TeacherActive | Teaching Jobs | 93 - 140/day | 27960.0 | cv-library.co.uk | Train |
377237 | 71623608 | yEAR ****/4 TEACHER CARLTON **** PER DAY | Year ****/4 Teacher required for Mapperley Area TeacherActive are currently recruiting for a Pri... | Nottinghamshire - Nottingham | Nottingham | full_time | permanent | UKStaffsearch | HR & Recruitment Jobs | NaN | NaN | ukstaffsearch.com | Test |
Notice the double space in the job title.
These are all posted by the same company in multiple locations but totaljobs.com has the company name as ‘Triple S Recruitment’ and cv-library.co.uk has it as ‘Triple S Recruitment Ltd’
== ('Assistant Sales Manager Market Leading Retailer')].sort_values('Company') df[df.Title
Id | Title | FullDescription | LocationRaw | LocationNormalized | ContractType | ContractTime | Company | Category | SalaryRaw | SalaryNormalized | SourceName | split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30332 | 68062445 | Assistant Sales Manager Market Leading Retailer | This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... | Bolton Lancashire North West | Bolton Le Sands | NaN | permanent | Triple S Recruitment | Sales Jobs | OTE 35-45k plus benefits | 40000.0 | totaljobs.com | Train |
227637 | 72444806 | Assistant Sales Manager Market Leading Retailer | This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... | Colne, Lancashire Lancashire North West | Colne | NaN | permanent | Triple S Recruitment | Sales Jobs | OTE 25- 30k plus benefits | 27500.0 | totaljobs.com | Train |
230526 | 72452426 | Assistant Sales Manager Market Leading Retailer | This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... | Stirling Stirlingshire Scotland | UK | NaN | permanent | Triple S Recruitment | Sales Jobs | OTE 30-35k plus benefits | 32500.0 | totaljobs.com | Train |
230527 | 72452429 | Assistant Sales Manager Market Leading Retailer | This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... | Brentford Middlesex South East | UK | NaN | permanent | Triple S Recruitment | Sales Jobs | OTE 35-40k plus benefits | 37500.0 | totaljobs.com | Train |
230936 | 72454431 | Assistant Sales Manager Market Leading Retailer | This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... | Dundee Angus Scotland | UK | NaN | permanent | Triple S Recruitment | Sales Jobs | OTE 35-45k plus benefits | 40000.0 | totaljobs.com | Train |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
206431 | 72120567 | Assistant Sales Manager Market Leading Retailer | The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... | Cambridge, Cambridgeshire | Cambridge | NaN | permanent | Triple S Recruitment Ltd | Retail Jobs | 15000 - 35000/annum OTE 30-35k plus benefits | 25000.0 | cv-library.co.uk | Train |
206432 | 72120572 | Assistant Sales Manager Market Leading Retailer | The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... | Llandudno, Wales | Llandudno | NaN | permanent | Triple S Recruitment Ltd | Retail Jobs | 15000 - 35000/annum OTE 30-35k plus benefits | 25000.0 | cv-library.co.uk | Train |
279043 | 72120569 | Assistant Sales Manager Market Leading Retailer | The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... | Cannock, Staffordshire | Cannock | NaN | permanent | Triple S Recruitment Ltd | Retail Jobs | NaN | NaN | cv-library.co.uk | Valid |
388642 | 72120555 | Assistant Sales Manager Market Leading Retailer | The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... | Stockton on Tees, North East | Stockton-On-Tees | NaN | permanent | Triple S Recruitment Ltd | Retail Jobs | NaN | NaN | cv-library.co.uk | Test |
206426 | 72120557 | Assistant Sales Manager Market Leading Retailer | The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... | Stirling, Scotland | Stirling | NaN | permanent | Triple S Recruitment Ltd | Retail Jobs | 15000 - 35000/annum OTE 30-35k plus benefits | 25000.0 | cv-library.co.uk | Train |
66 rows × 13 columns
== 8] titles[titles.companies
companies | jobs | |
---|---|---|
Title | ||
GRADUATE SALES EXECUTIVE / GRADUATE ACCOUNT MANAGER | 8 | 110 |
Account Manager / Sales Executive | 8 | 58 |
Relief Support Worker | 8 | 41 |
LGV CE Driver | 8 | 39 |
English Teaching Assistant | 8 | 36 |
... | ... | ... |
Senior Data Analyst | 8 | 8 |
Senior Electrical Estimator | 8 | 8 |
Syndicate Accountant | 8 | 8 |
Telephone Researcher | 8 | 8 |
Web Content Editor | 8 | 8 |
288 rows × 2 columns
Even at 8 Companies we still get some false positives.
These are all the same job ad!
== 'GRADUATE SALES EXECUTIVE / GRADUATE ACCOUNT MANAGER'].Company.value_counts() df[df.Title
BMS Sales Specialists LLP 27
BMS Graduate 16
BMS Graduates 15
London4Jobs 5
BMS GROUP 4
BMS Sales and Marketing Specialists 4
UKStaffsearch 2
BMS Graduate Recruitment 1
Name: Company, dtype: int64
We’ll start the cutoff at 10; the data is reasonably clean there, and captures the top 1% of role titles.
>= 10).mean(), (titles.companies >= 10).sum() (titles.companies
(0.008212474192643949, 1611)
Output into a CSV for further analysis in a spreadsheet program.
!mkdir -p output
>= 10].to_csv('output/common_titles.csv') titles[titles.companies